Analysis of Autonomous Systems

AA120Q: Building Trust in Autonomy, Stanford University.

Lecture 8

We will discuss a variety of methods for analyzing the behavior of autonomous systems.

Assignment:

  • Run simulations to characterize collision avoidance performance against a variety of metrics; develop methods for visualizing the decision making behavior of your system.

57.5 ms

Measuring System Effectiveness

It is the designer's responsibility to go back to the field and assess the impact that the autonomous system is having. This measurement process must be both qualitative and quantitative.

Qualitative: Deals with the quality of a result. Does the policy followed by the agent look good? Is it behaving reasonably?

Quantitative: Objective values that can be quantified. For example, with ACAS X, one should look at operational data on airborne collisions, near-misses, and separation after ACAS X has been put into place.

18.1 μs

Reward

Autonomous agents are often trained to maximize their reward. Does your agent receive high reward? That's all you care about, right?

Wrong.

We care about the performance of the system in the real world, and the real world is never perfectly modeled.

7.9 μs

The Pareto Frontier

When optimizing a real-world system one often must balance a large number of trade-offs.

Which of the following is better?

  • an airborne collision avoidance system that has 1 collision and 1000 alerts per million flight hours

  • an airborne collision avoidance system that has 2 collisions and 10 alerts per million flight hours

8.5 μs
1.2 ms
78.8 ms

Fewer collisions are good, and fewer alerts are good, but we cannot say which collision system is better without making a judgement on their relative value.

We know, therefore, that if we have a set of policies:

4.0 μs
11.2 ms

The policies that are potentially the best are those which cannot be shifted to be made better in both respects:

10.9 μs
4.3 ms

The Pareto Frontier is obtained by adjusting the tradeoff between your multiple objectives and optimizing models to trace out the curve.

The region closer to the origin than the Pareto Frontier is infeasible, whereas the region farther from the origin than the Pareto Frontier is suboptimal.

8.6 μs

Given a Pareto Frontier, how do we choose the best policy?

This is often a subjective question, and often requires the careful consideration of factors that are not in your optimization objective. Domain experts are often consulted.

4.2 μs

Inspect the Decision Making Behavior

Has your agent really learned to do what it was designed to do?

If you have trained a neural network to recognize cats, how do you know whether the neural network has really learned what a cat is?

Below we see the result of optimizing a neural network trained to recognize dumbbells. It turns out that the net sees dumbbells as dumbbells with forearms.

6.4 μs
181 ms

The Black Swan Problem

This problem is known as the Black Swan Problem. The problem gets its name from the black swans of Australia and New Zealand, and the incorrect induction followed by a European:

All swans I have seen are white, therefore all swans are white

Of course, once said European goes to southern Australia and sees a black swan they can either change their belief or forever categorize the black swan as an entirely different species.

With autonomous agents we want to make sure that they identify the correct categories. It is often a non-trivial problem.

For an Autonomous Car, are these Pedestrians?

6.5 μs

Sure looks like a pedestrian.

Also a pedestrian–-but this one also has a bike. Maybe our definition should be "A person walking across the street".

Whoops! That didn't work. Hmm. Harder than we thought!

161 ms

How to Get Past the Black Swan Problem

The Black Swan problem is a fundamental problem in artificial intelligence and machine learning. The best way around it is to have as large and comprehensive dataset as possible and to test on as many corner cases as possible. Visualize your agent's decision making process!

3.7 μs

Cross Validation

In this class you are trying to optimize an airborne collision avoidance system. You have been given a dataset of encounters. How do you go about tuning your model parameters for maximum performance?

  • Maximize Performance on the Given Dataset

This option is very tempting. You simply optimize the system to maximize the reward when run in simulation on the training set. What could go wrong?

Let us provide an illustrative example: adjusting the number of histogram bins to get the best fit for a distribution.

Consider the following true distribution:

6.6 μs
2.5 s
true_dist
MixtureModel{Distributions.Normal{Float64}}(K = 3)
components[1] (prior = 0.1250): Distributions.Normal{Float64}(μ=0.0, σ=0.3)
components[2] (prior = 0.1250): Distributions.Normal{Float64}(μ=-1.0, σ=0.3)
components[3] (prior = 0.7500): Distributions.Normal{Float64}(μ=0.0, σ=1.0)
122 ms
923 μs

You want to get the best model for this distribution. The problem is, all you have to train on is a sample dataset:

3.9 μs
2.8 ms
9.2 ms
samples
25.7 ms
1.9 s

Suppose we want to use a piecewise uniform distribution with even bin widths. Above I used 20 bins to create one. How do we select the best number of bins?

3.9 μs
124 ms
2.3 ms

Clearly the correct number is somewhere between 5 and 100, but what should we use? Remember, we don't have access to the true distribution.

One approach to use is to do a train-test split.

This involves taking the available data and using some to fit the distribution and the rest to check the fitness:

5.1 μs
2.3 μs
11.4 ms

We can then use some metric, perhaps the likelihood of the test data under the learned model, to select the preferred nbins.

This approach tests the ability of your model to generalize to unseen data.

3.3 μs
1.1 ms
get_likelihood (generic function with 1 method)
47.4 μs
1.2 ms
29.1 ms
2.2 ms
3.6 ms

Two things:

  1. Notice how small the likelihoods are? Maximizing the log-likelihood gives the same result but with nicer numbers

  2. Notice those zeros? Those occur whenever we have a training sample with zero likelihood. We can add a prior of one count to each bin to ensure that we get some support. This is also called Laplace smoothing.

6.8 μs
get_loglikelihood (generic function with 1 method)
57.8 μs
4.0 ms

Train-test splitting is pretty good, but we can do even better if we do this over multiple train-test splits. Cross validation is one common way of doing this.

In k-fold cross validation, you take your training data and divide it into k even chunks. For each chunk, you

  • train on all of the k1 other chunks

  • test on the chunk

And then take the average over all k validation scores to get your cross-validated score.

9.4 μs
60.5 ms

Measuring the 'Closeness' of Distributions

One often needs to measure the closeness of two distributions, such as when comparing distributions over emergent metrics to their real-world counterparts.

For example, if you create a car that is supposed to drive like a human, does it tend to drive with the same following distance as a real car does?

We can all agree that the following two distributions are close.

4.9 μs
real
Distributions.Normal{Float64}(μ=0.0, σ=1.0)
4.5 μs
plot_distr (generic function with 1 method)
74.2 μs
87.9 ms

We can all agree that the following two distributions are less close.

10.2 μs
1.3 ms

How close are these distributions?

4.9 μs
1.2 ms
50.9 ms
sim
MixtureModel{Distributions.Cauchy{Float64}}(K = 5)
components[1] (prior = 0.1000): Distributions.Cauchy{Float64}(μ=-5.0, σ=1.8)
components[2] (prior = 0.4000): Distributions.Cauchy{Float64}(μ=-4.0, σ=0.8)
components[3] (prior = 0.1500): Distributions.Cauchy{Float64}(μ=-1.0, σ=0.3)
components[4] (prior = 0.2000): Distributions.Cauchy{Float64}(μ=2.0, σ=0.8)
components[5] (prior = 0.1500): Distributions.Cauchy{Float64}(μ=4.0, σ=1.5)
41.1 ms
49.4 ms

The Kullbeck–Leibler divergence is one way to measure closeness. It actually measures divergence, the opposite of closeness.

DKL(p∣∣q)=p(x)logp(x)q(x)dx

It is 0 if the two distributions are the same and increases if they are different.

Note that it is non-symmetric, so DKL(p∣∣q)DKL(q∣∣p). Here, p is the true distribution and q is what is being used to approximate p.

6.9 μs
kldivergence (generic function with 1 method)
34.1 μs

The KL divergence for two Gaussian distributions p=N(μ1,σ1) and q=N(μ2,σ2) is:

DKL(p∣∣q)=log(σ2σ1)+σ12+(μ1μ2)22σ2212

3.3 μs
x_values
-5.0:0.1:5.0
3.8 μs
0
71.0 μs
0.88
53.5 μs
19.1 ms

Assignment

Your homework is to:

  • Run your collision avoidance system on the training set and pick three encounters where the system fails. Why does it fail? Is this a problem with your design? How might it be fixed?

  • Run your collision avoidance system on the training set and compute:

    • the number of NMACs

    • the number of NMACs in which no advisory was issued

    • a histogram over the aircraft separation distance when advisories were issued

    • a scatter plot of relative horizontal separation vs. relative vertical separation when advisories were issued

    • a scatter plot of vehicle bearing the clockwise bearing from your craft to the intruder at the encounter start vs. the clockwise heading of the intruder relative to the positive north axis (See Lecture 4) for when advisories were issued

  • Choose a meaninful tuneable parameter in your collision avoidance system, or change your system to include a meaningful tuneable parameter. (For example, a parameter in the Alpha-Beta filter). Use 5-fold cross validation over the training dataset and plot the cross-validated-normalized-penalty with respect to the tuneable parameter.

Turn in your code and writeup (preferably a single Julia Notebook) to Canvas.

11.8 μs

3.7 μs
6.4 ms